Data Labeling Startups funded by Y Combinator (YC) 2026

May 2026

Browse 18 of the top Data Labeling startups funded by Y Combinator.

We also have a Startup Directory where you can search through over 5,000 companies.

  • Shofo
    Shofo
    Y Combinator LogoW2026
    Active • 4 employees • San Francisco
    We are building the world’s largest video library. We've aggregated billions of videos into a searchable index and use agents to find and label the exact datasets a lab needs on demand. If a lab needs 100K hours of cooking videos where someone is holding a pan, with reasoning annotations on top, our agents search the index, extract the matching subset, route it through our labeling pipeline, and deliver a custom dataset in days, not months.
    data-labeling
    artificial-intelligence
    infrastructure
  • Sciloop
    Sciloop
    Y Combinator LogoF2025
    Active • 2 employees • San Francisco, CA, USA
    Sciloop creates expert-level math and physics problems that frontier AI models can't solve, then sells the data to AI labs for training and evaluation. Our problems are created by IPhO and IMO medalists — the top 0.01% of STEM talent globally. On our benchmark, models like GPT 5.4 Pro and Gemini 3.1 Pro score 0-5% on our hardest problems. We work with AI labs to supply continuous, fresh training data that pushes the frontier of mathematical and scientific reasoning. Founded by Bilal and Osman, International Physics Olympiad medalists from MIT with hands-on ML research experience at MIT CSAIL.
    artificial-intelligence
    big-data
    data-labeling
    marketplace
  • Panels
    Panels
    Y Combinator LogoS2025
    Active • 2 employees • San Francisco, CA, USA
    Panels is an audio data platform that delivers high-quality speech datasets from vetted, diverse contributors to power training and evaluation of foundational voice models.
    data-labeling
    b2b
    artificial-intelligence
  • Liva AI
    Liva AI
    Y Combinator LogoS2025
    Active • 2 employees • San Francisco, CA, USA
    Speech models trained on internet data still lack realistic results. We solve this by collecting targeted training data for model labs. We hope to create a world where AI feels more human.
    b2b
    big-data
    data-labeling
    marketplace
    artificial-intelligence
  • Besimple AI
    Besimple AI
    Y Combinator LogoP2025
    Active • 6 employees • San Francisco
    We are building the data layer for AI, starting with audio. We start with data collection, curating our own proprietary set of diverse conversational data covering a wide range of languages, dialects and accents. We then leverage human expert audio annotators and our own annotation platform to process audio data for Automatic Speech Recognition. With human level transcription and diarization, our data help push the audio model frontier. Today we have over millions of hours of conversational data, and growing. If you need audio data for training or evaluating your voice models or voice agents, reach out! We offer flexible licensing deals that work for startups and enterprises, with minimal process. Audio data should besimple :)
    artificial-intelligence
    data-labeling
    aiops
  • Cartpole
    Cartpole
    Y Combinator LogoP2025
    Active • 1 employees • San Francisco, CA, USA
    We're creating reinforcement learning environments for training frontier models.
    reinforcement-learning
    ml
    ai
    data-labeling
  • Sureform
    Sureform
    Y Combinator LogoP2025
    Active • 2 employees • Palo Alto, CA, USA
    We collect high-quality human data, across diverse interactions and environments, to help advance the next generation of multimodal AI and robotics models.
    data-labeling
    robotics
    marketplace
    artificial-intelligence
  • AfterQuery
    AfterQuery
    Y Combinator LogoW2025
    Active • 30 employees • San Francisco, CA, USA
    AfterQuery is an applied research lab curating data solutions for frontier foundation model development. Serving every frontier AI lab.
    b2b
    artificial-intelligence
    ai
    big-data
    data-labeling
  • Sieve
    Sieve
    Y Combinator LogoW2022
    Active • 18 employees • San Francisco, CA, USA
    Sieve is the only AI research lab exclusively focused on video data. Video already makes up 80% of internet traffic and has become the dominant medium driving creativity, communication, gaming, AR/VR, and robotics. Unlocking the ability to truly model video is the key to breakthroughs across all of these domains but progress has been bottlenecked by one thing: high-quality training data. That’s where Sieve comes in. We bring together exabyte-scale video infrastructure, novel video understanding techniques, and dozens of diverse data sources to create datasets that push the frontier of video modeling. This unique combination allows us to deliver data with unmatched precision, quality, and speed which has earned the trust of frontier AI labs, Fortune 100 companies, and fast-growing generative AI startups.
    video
    developer-tools
    ai
    data-engineering
    data-labeling
  • Spade
    Spade
    Y Combinator LogoW2022
    Active • 25 employees • New York, NY, USA
    Spade is the next generation of fintech infrastructure. We’re building a financial data enrichment API purpose built to empower our customers to uncover the truth hidden within their transaction data. We use our vast, ground-truth merchant data set to decipher cryptic transactions, helping customers underwrite, detect fraud, build better banking infrastructure and get a unique understanding of their users’ spending habits.
    fintech
    machine-learning
    payments
    data-labeling
    ai
  • Lightly
    Lightly
    Y Combinator LogoS2021
    Active • 5 employees • Zürich, Switzerland
    When ML teams send their data to companies like Scale.ai for labeling, most can only afford to label 1% or less of their datasets. But today they don’t have a good way to pick which 1% to label. We help them pick the best 1% of their data to label. By labeling the most representative data, they significantly improve model accuracy at the same cost.
    machine-learning
    data-labeling
  • Centaur
    Centaur
    Y Combinator LogoW2019
    Active • 45 employees • Boston, MA, USA
    The best AI models aren’t just trained and evaluated with human data; they’re built with superhuman data. The strongest datasets emerge through collective intelligence, where humans and machines work together to outperform either one alone. At Centaur, we create superior quality data by turning annotation into an arena where experts and AI compete.
    data-labeling
    crowdsourcing
    data-science
    artificial-intelligence
  • Sepal AI
    Sepal AI
    Y Combinator LogoS2024
    Acquired • 15 employees • San Francisco, CA, USA
    Sepal is a data research company on a mission to advance human knowledge and capabilities through safe AI. We partner with the world’s leading AI labs and enterprises to help their models get better at the tasks people actually want them to do. We’ve built a Cloud-Native Agent Dataset Factory which turns the process of generating evaluation and training data from manual, inconsistent, and labor-intensive into something automated, standardized, and scalable. Sepal AI was founded in 2024 by engineers and operators from Vercel and Turing. We went through Y Combinator, raised several million dollars from leading investors, and already count multiple Fortune 500s and top AI research labs as paying customers.
    data-labeling
    aiops
    reinforcement-learning
    ai
  • Deasy Labs
    Deasy Labs
    Y Combinator LogoS2023
    Acquired • 8 employees • New York City
    Deasy Labs was acquired by Collibra in July 2025 (global leader in enterprise data governance). Deasy Labs provides metadata orchestration for AI workflows. Deasie's platform provides the best way for AI teams to create and embed high-quality, customized metadata into their AI workflows (e.g., RAG, Agentic frameworks). Our three founders (from Amazon, McKinsey/QuantumBlack & MIT) previously built an ML data governance tool from 0 to 1 within McKinsey, which we deployed with 11 Fortune 500 companies. We saw in early 2023 the ability to create high-quality metadata (without reliance on domain experts) would be a key factor in achieving the accuracy & speed in GenAI applications required for production. Our investors include General Catalyst, Y Combinator, RTP Global and world experts in enterprise data. Website: https://deasylabs.com
    ai-assistant
    data-labeling
    databases
    big-data
    artificial-intelligence
  • JumpWire
    JumpWire
    Y Combinator LogoW2022
    Acquired • 2 employees • New York, NY, USA
    JumpWire is a data protection platform that adds advanced data security controls between APIs, applications and databases. JumpWire automatically identifies sensitive properties inside large data sets and gives developers full control over which people and applications can access or update records containing sensitive info. Examples uses include restricting who can read customer PII to members of the customer service team, giving on-call engineers elevated access to production, or splitting user records between regions for GDPR purposes. JumpWire’s approach to securing data in-place minimizes the risk of data leaks exposing sensitive information or mishandling by other applications and vendors. The exact security scheme applied to data is defined by policies that align with an organization’s existing InfoSec program. JumpWire helps companies who maintain information security with compliance programs such as SOC or HIPAA. They are processing sensitive data, often from their own customers, and exceed security best practices as a competitive advantage. JumpWire provides defense at depth to data and sits alongside access controls and Layer 4 encryption to provide a comprehensive data security solution. JumpWire is unique from solutions such as data vaults by installing inside our customers’ own infrastructure and clouds. It is interoperable with existing applications and databases, which eliminates the need for large data migrations or code refactoring. Lower-level approaches to data security, such as encryption at rest, are too blunt and lack the ability to differentiate between properties in the data itself. Its scope is limited to physical storage, and security is lost as soon as an application or query loads the data.
    security
    data-labeling
    databases